The dataset provided gathers information related to a telemarketing campaign applied by a banking institution. Throughout the campaign, clients were contacted multiple times in regard to a term deposit subscription offer.
The goal of this project is to predict if a client will subscribe to the term deposit offer. To this end a classification analysis was conducted. The term deposit target variable is labelled ‘y’, and has the possible outcomes ‘yes’ or ‘no’.
The first part of this report revolves around understanding the data. Here, the data quality is evaluated and some basic cleaning and pre-processing is undertaken in order to prepare the data for the baseline model which will act as a benchmark for all future data transformation in the feature engineering process. Within this pipe-line, pre-processing involves anything that can be done to get the data ready for a baseline model that does not require an EDA (Exploratory Data Analysis). This typically involves loading the data, merging it, looking for NA’s, looking for missing values, dropping meaningless variables and correcting variable types.
First, we load the necessary train and test data.
train = fread("BankCamp_train.csv", stringsAsFactors = T)
test = fread("BankCamp_test.csv", stringsAsFactors = T)
Here we switch the order of the factors to get metrics related to the “yes” target.
train$y <- factor(train$y, levels = c("yes", "no"))
test$y <- factor(test$y, levels = c("yes", "no"))
Finally we set the types for numerical variables:
train[ , which(sapply(train, is.integer)):=lapply(.SD,as.numeric), .SDcols = sapply(train, is.integer)]
test[ , which(sapply(test, is.integer)):=lapply(.SD,as.numeric), .SDcols = sapply(test, is.integer)]
We explore the first few rows, structure and summary of the training data to see if there are any inconsistencies.
We see there are neither inconsistent missing values nor duplicates in the set. However, we see that the “pdays” variable contains ‘-1’ values, which we assume indicates that the information is not applicable to the client.
head(train)
age job marital education default balance housing loan
1: 50 entrepreneur married primary yes 537 yes no
2: 47 technician married secondary no -938 yes no
3: 56 housemaid married primary no 605 no no
4: 36 blue-collar married primary no 4608 yes no
5: 41 management married primary no 362 yes no
6: 32 student single tertiary no 0 no no
contact day month duration campaign pdays previous poutcome y
1: unknown 20 jun 11 15 -1 0 unknown no
2: unknown 28 may 176 2 -1 0 unknown no
3: cellular 19 aug 207 6 -1 0 unknown no
4: cellular 14 may 284 7 -1 0 unknown no
5: cellular 12 may 217 3 -1 0 unknown no
6: cellular 4 feb 233 3 276 2 failure yes
str(train)
Classes 'data.table' and 'data.frame': 36168 obs. of 17 variables:
$ age : num 50 47 56 36 41 32 26 60 39 55 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 3 10 4 2 5 9 9 2 8 1 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 3 2 1 1 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 1 1 1 3 2 1 2 2 ...
$ default : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
$ balance : num 537 -938 605 4608 362 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 1 1 1 1 1 2 1 3 ...
$ day : num 20 28 19 14 12 4 29 12 16 3 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 7 9 2 9 9 4 5 9 1 7 ...
$ duration : num 11 176 207 284 217 233 297 89 539 131 ...
$ campaign : num 15 2 6 7 3 3 1 2 1 1 ...
$ pdays : num -1 -1 -1 -1 -1 276 -1 -1 -1 -1 ...
$ previous : num 0 0 0 0 0 2 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 1 4 4 4 4 ...
$ y : Factor w/ 2 levels "yes","no": 2 2 2 2 2 1 2 2 2 2 ...
- attr(*, ".internal.selfref")=<externalptr>
summary(train)
age job marital education
Min. :18.00 blue-collar:7848 divorced: 4148 primary : 5517
1st Qu.:33.00 management :7512 married :21838 secondary:18557
Median :39.00 technician :6060 single :10182 tertiary :10598
Mean :40.96 admin. :4181 unknown : 1496
3rd Qu.:48.00 services :3286
Max. :95.00 retired :1826
(Other) :5455
default balance housing loan contact
no :35509 Min. :-8019 no :16034 no :30354 cellular :23408
yes: 659 1st Qu.: 72 yes:20134 yes: 5814 telephone: 2334
Median : 448 unknown :10426
Mean : 1360
3rd Qu.: 1428
Max. :98417
day month duration campaign
Min. : 1.00 may :11064 Min. : 0.0 Min. : 1.000
1st Qu.: 8.00 jul : 5551 1st Qu.: 103.0 1st Qu.: 1.000
Median :16.00 aug : 4945 Median : 180.0 Median : 2.000
Mean :15.81 jun : 4292 Mean : 258.8 Mean : 2.764
3rd Qu.:21.00 nov : 3175 3rd Qu.: 320.2 3rd Qu.: 3.000
Max. :31.00 apr : 2335 Max. :4918.0 Max. :63.000
(Other): 4806
pdays previous poutcome y
Min. : -1.00 Min. : 0.000 failure: 3942 yes: 4195
1st Qu.: -1.00 1st Qu.: 0.000 other : 1444 no :31973
Median : -1.00 Median : 0.000 success: 1198
Mean : 40.08 Mean : 0.574 unknown:29584
3rd Qu.: -1.00 3rd Qu.: 0.000
Max. :854.00 Max. :58.000
sapply(train, function(x) sum(is.na(x)))
age job marital education default balance housing
0 0 0 0 0 0 0
loan contact day month duration campaign pdays
0 0 0 0 0 0 0
previous poutcome y
0 0 0
any(duplicated(train))
[1] FALSE
In the first step, we explore the distribution of the variables, in order to gain insights of the data and inform decisions on the later feature engineering and modelling process.
sort(summary(train$y), dec=T)/nrow(train)
no yes
0.8840135 0.1159865
p0<-ggplot(train, aes(x=y))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4", label = TRUE)+
theme(axis.text.x = element_text(angle=0))+ labs(title = "Distribution of Clients Subscribed")+
xlab("y")+ ylab("Count")
train[, job:=factor(job, levels=names(sort(summary(train$job), dec=T)))]
levels(train$job)
[1] "blue-collar" "management" "technician" "admin."
[5] "services" "retired" "self-employed" "entrepreneur"
[9] "unemployed" "housemaid" "student" "unknown"
p0
We see that the target variable is heavily unbalanced, with over 85% of the values denoting ‘no’. In the business context this is explained by the fact that the term deposit subscription is not targeted towards the majority of clients.
For later modelling and model evaluation, this poses a challenge, as the model is likely to achieve a high accuracy by purely training on and predicting ‘no’ targets, despite having a very low sensitivity score (namely being poor at correctly indentifying true positive targets).
Next, a correlation plot is constructed to see if there is any significant correlation between numeric explanatory variables.
numerical <- select_if(train,is.numeric)
corr_plot <- ggcorr(numerical, label_round=2, label = TRUE)
corr_plot
As we can see, no variables have significant correlations, apart from “previous” and “pdays”. We therefore do not need to spend a lot of time on transforming or dropping variables purely based on multivariate correlation between numeric explanatory variables.
Here is an interactive version of the correlation matrix.
plot_ly(x = colnames(train),
y=colnames(train),
z = cor(numerical), colors = colorRamp(c("dodgerblue4", "brown2")),
type = "heatmap") %>% layout(title="Interactive Correlation Matrix")
num_y <- train[, c('age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous','y')]
ggpairs(num_y[,1:8], title = "Distribution & Correlation graph", ggplot2::aes(colour=y, alpha=1/100))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the above graph we see how the numeric variables relate to each other and to the target variable. We can see through the diagram diagonal, that their might be trends within the continuous distributions of the explanatory variables that match up more with certain ‘yes’, or certain ‘no’ factors and can explore using these trends to bin variables in the feature engineering phase.
We see that many of the variables are skewed, so we will consider normalizing the scales of the numeric variables at a later stage.In this subsection, we explore the categorical variable distributions to inform on later feature engineering decisions.
sort(summary(train$job), dec=T)/nrow(train)
blue-collar management technician admin. services
0.216987392 0.207697412 0.167551427 0.115599425 0.090853793
retired self-employed entrepreneur unemployed housemaid
0.050486618 0.035003318 0.032736120 0.028257023 0.027897589
student unknown
0.020874806 0.006055076
p1<-ggplot(train, aes(x=job))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=90))+ labs(title = "Distribution by Kind of Job")+
xlab("Job")+ ylab("Count")
train[, job:=factor(job, levels=names(sort(summary(train$job), dec=T)))]
levels(train$job)
[1] "blue-collar" "management" "technician" "admin."
[5] "services" "retired" "self-employed" "entrepreneur"
[9] "unemployed" "housemaid" "student" "unknown"
p1
We see that a large portion of potential targets is made up of blue collar and management related jobs.
sort(summary(train$marital), dec=T)/nrow(train)
married single divorced
0.6037934 0.2815196 0.1146870
p2<-ggplot(train, aes(x=marital))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=0))+ labs(title = "Distribution by Marital Status")+
xlab("Marital Status")+ ylab("Count")
train[, marital:=factor(marital, levels=names(sort(summary(train$marital), dec=T)))]
levels(train$marital)
[1] "married" "single" "divorced"
p2
Most of the historic potential targets were married, with a fraction being divorced.
sort(summary(train$education), dec=T)/nrow(train)
secondary tertiary primary unknown
0.51307786 0.29302146 0.15253816 0.04136253
p3<-ggplot(train, aes(x=education))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=0))+labs(title = "Distribution by Education Level")+
xlab("Education Level")+ ylab("Count")
train[, education:=factor(education, levels=names(sort(summary(train$education), dec=T)))]
levels(train$education)
[1] "secondary" "tertiary" "primary" "unknown"
p3
Most historic potential targets have at least secondary education, making them good candidates from a business point of view.
sort(summary(train$month), dec=T)/nrow(train)
may jul aug jun nov apr
0.30590577 0.15347821 0.13672307 0.11866844 0.08778478 0.06455983
feb jan oct sep mar dec
0.05786883 0.03121544 0.01645101 0.01205486 0.01064477 0.00464499
train$month <- factor(train$month, levels = c( "jan", "feb", "mar", "apr","may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"))
p4<-ggplot(train, aes(x=month))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=90))+labs(title = "Distribution by Month")+
xlab("Month")+ ylab("Count")
levels(train$month)
[1] "jan" "feb" "mar" "apr" "may" "jun" "jul" "aug" "sep" "oct" "nov"
[12] "dec"
p4
We see spikes of subscriptions in May, sustained throughout the summer and dropping during the rest of the year.
sort(summary(train$loan), dec=T)/nrow(train)
no yes
0.8392502 0.1607498
p5<-ggplot(train, aes(x=loan))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=0))+labs(title = "Loans Acquired")+
xlab("Loans")+ ylab("Count")
p5
We see that the majority of potential targets do not have a loan.
sort(summary(train$housing), dec=T)/nrow(train)
yes no
0.5566799 0.4433201
p6<-ggplot(train, aes(x=housing))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=0))+labs(title = "Housing Loan Acquired")+
xlab("Housing Loan")+ ylab("Count")
p6
The housing loans however, are more balanced.
sort(summary(train$contact), dec=T)/nrow(train)
cellular unknown telephone
0.64720195 0.28826587 0.06453218
p7<-ggplot(train, aes(x=contact))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=0))+labs(title = "Contact Type")+
xlab("Contact")+ ylab("Count")
p7
Most potential applicants can be reached by cellular phone, a significant portion however cannot be reached, as they are unknown.
sort(summary(train$poutcome), dec=T)/nrow(train)
unknown failure other success
0.8179606 0.1089914 0.0399248 0.0331232
p8<-ggplot(train, aes(x=poutcome))+geom_bar(stat='count',fill="dodgerblue4", colour="dodgerblue4")+
theme(axis.text.x = element_text(angle=0))+labs(title = "Previous Campaign Outcome")+
xlab("Outcome")+ ylab("Count")
p8
In this graph most values are unknown, for this reason the variable has low explanatory power for the model.
In the bivariate analysis, we can see what proportion of clients have subscribed and not subscribed according to each variable. We further explore trends amongst the explanatory variables, to inform further feature creation.
p9<-ggplot(train,
aes(x = marital,
fill = y)) +
geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+
labs(title = "Term Deposit Subscription by Marital Status")+xlab("Marital Status")+ ylab("Count")
p9
We see that the proportion of married clients that seem to go for a term deposit subscription is smaller than single or divorced people.
p10<-ggplot(train,
aes(x = education,
fill = y)) +
geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+
labs(title = "Term Deposit Subscription by Education Level")+xlab("Education Level")+ ylab("Count")
p10
Tertiary education people seems to be more tend to go for a term deposit subscription, than primary or secondary.
p11<-ggplot(train,
aes(x = job,
fill = y)) +
geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+ theme(axis.text.x = element_text(angle=90))+
labs(title = "Term Deposit Subscription by Job")+xlab("Job")+ ylab("Count")
p11
It seems that student clients go for a term deposit subscription more than clients working or retired.
p12<-ggplot(train,
aes(x = age,
fill = y)) +
geom_bar(position = "dodge")+ scale_fill_manual(values=c("dodgerblue3", "dodgerblue4"))+ theme(axis.text.x = element_text(angle=90))+
labs(title = "Term Deposit Subscription by Age")+xlab("Age")+ ylab("Count")
p12
It seems that middle-aged clients, from 30 to 35 years, go for a term deposit subscription more than younger and older clients.
In the following section we analyze the distribution of the numerical explanatory variables according to their response to the term deposit subscription.
p13<-ggplot(train, aes(x=as.factor(y), y=duration)) +
geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
theme_classic()+labs(title = "Call Duration by Term Deposit Subscription")+xlab("Term deposit")+
ylab("Call Duration (sec.)")
p13
p14<-ggplot(train, aes(x=as.factor(y), y=pdays)) +
geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
theme_classic()+labs(title = "Days from last contact by Term Deposit Subscription")+xlab("Term deposit")+
ylab("Days from last contact")
p14
p15<-ggplot(train, aes(x=as.factor(y), y=previous)) +
geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
theme_classic()+labs(title = "Contacts before this Campaign by Term Deposit Subscription")+xlab("Term deposit")+
ylab("Contacts before this campaign")
p15
p16<-ggplot(train, aes(x=as.factor(y), y=campaign)) +
geom_boxplot(fill="dodgerblue3", color="dodgerblue4")+
theme_classic()+labs(title = "Contacts during this Campaign by Term Deposit Subscription")+xlab("Term deposit")+
ylab("Contacts during this campaign")
p16
The coefficient of variation is a dimensionless meassure of dispersion in data, the lower the value the less dispersion a feature has.
numeric_variables<-names(numerical)
sd_numeric_variables<-sapply(train[,numeric_variables, with=F], sd)
cv_numeric_variables<-sd_numeric_variables/colMeans(train[,numeric_variables, with=F])
ggplot(data.table(var=names(cv_numeric_variables),cv=cv_numeric_variables),
aes(var,fill=cv))+geom_bar()+coord_polar()+scale_fill_gradient(low='white', high = 'dodgerblue4')
Viewing variables with less than a 0.05 coefficient of variation. There are none.
cv_numeric_variables[cv_numeric_variables < 0.05]
named numeric(0)
Caret finds sets of linear combinations to remove them, however there are none.
lc<-findLinearCombos(train[, ..numeric_variables])
lc
$linearCombos
list()
$remove
NULL
To reduce the weight/importance of certain features due to their scale, we center them and scale them. We also reduce skewness by performing a Yeo Johnson transformation since it can handle negative values. This combination will be performed on each model training.
pp<-c("center", "scale", "YeoJohnson")
preProcess(train, method =pp)
Created from 36168 samples and 17 variables
Pre-processing:
- centered (7)
- ignored (10)
- scaled (7)
- Yeo-Johnson transformation (6)
Lambda estimates for Yeo-Johnson transformation:
-0.19, 0.84, 0.69, 0.13, -1.08, -0.6
We dummy encode the factor variables to be able to run different models on the data.
dummy<-dummyVars(formula= ~., data = train[, -"y"], sep = "_")
final_train<-data.table(predict(dummy, newdata = train[, -"y"]))
final_train$y<-train$y
final_test<-data.table(predict(dummy, newdata = test))
Now we define our training function, it will perform a 5 fold cross validation on 80% of the training data and then a final validation on the remaining 20% of the data that acts as a holdout. It also does the aforementioned preprocessing.
train_val<- function(train_dt, model, sampling){
tc<-trainControl(
method = "cv",
number=5,
savePredictions = TRUE,
classProbs=TRUE,
summaryFunction = prSummary)
trainIndex <- createDataPartition(train_dt$y, p = .8, list = FALSE)
model_train <- train_dt[ trainIndex,]
holdout <- train_dt[-trainIndex,]
if(!missing(sampling)){
if(sampling == 'over'){
model_train<-upSample(x = model_train[, -"y"],y = model_train$y, yname="y")
}
else if(sampling == 'under'){
model_train<-downSample(x = model_train[, -"y"],y = model_train$y, yname="y")
}
else {
model_train<-SMOTE(y ~ ., data = model_train)
}
}
ini<-now()
model<- train(y~ ., data = model_train, method = model, metric="Recall", trControl=tc, preProcess=pp)
message("Cross Validation Scores having Yes as the positive class")
message(model$results)
predicted = predict(model, newdata = holdout)
message("Train + Predict time:")
message(now()-ini)
print("Holdout Scores")
print(confusionMatrix(table(predicted, holdout$y), positive="yes", mode="everything"))
return(model)
}
Now we will run a simple logistic regression as a baseline, this will serve as a way of knowing if our feature engineering steps improve performance.
lm <- train_val(final_train, "glm")
Cross Validation Scores having Yes as the positive class
10.553077543742420.6436530298069690.34684284294940.4507432354300750.0149883538944520.01689834950935990.0143103103567690.0159508589731401
Train + Predict time:
1.29214898347855
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 283 140
no 556 6254
Accuracy : 0.9038
95% CI : (0.8967, 0.9105)
No Information Rate : 0.884
P-Value [Acc > NIR] : 4.007e-08
Kappa : 0.402
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.33731
Specificity : 0.97810
Pos Pred Value : 0.66903
Neg Pred Value : 0.91836
Precision : 0.66903
Recall : 0.33731
F1 : 0.44849
Prevalence : 0.11600
Detection Rate : 0.03913
Detection Prevalence : 0.05848
Balanced Accuracy : 0.65771
'Positive' Class : yes
So our Sensitivity or Recall is very low based on the “yes” class, meaning that we correctly classify a really small set of the “yes” cases.
Here we merge the two datasets to perform multiple feature engineering steps
y <- final_train$y
final_train$y <- NULL
merged_df <- rbind(final_train, final_test)
We create a variable that measures days to end of month, to see if the date contacted could have any explanatory value in our prediction.
merged_df$days_to_end_of_month <- 30 - merged_df$day
We booleanize the ‘balance’ variable, to make the distinction between having a balance and not, more explicit.
positive_balance <- function(number){
is_positive = 0
if(number>0){is_positive <- 1}
return(is_positive)
}
merged_df$balance_general_status <- as.numeric(lapply(merged_df$balance, FUN = positive_balance))
We create a quantity of loans category, since having a loan or more can make a difference in the decision of a potential client.
merged_df$quantity_loans <- merged_df$housing_yes+merged_df$loan_yes
We bin days to weeks incase there is correlation between target and which week the target was contacted.
week_type <- function(day){
week_num=0
if(day < 7){
week_num <-1
}else if(day< 15){
week_num <-2
}else if(day<22){
week_num <-3
}else if(day<30){
week_num<-4
}else{week_num <-5}
return(week_num)
}
merged_df$week <- as.factor(as.numeric(lapply(merged_df$day, FUN = week_type)))
dmy<-dummyVars("~.",data = merged_df)
merged_df<-data.table(predict(dmy, newdata = merged_df))
In the EDA we have seen that trends differ to seasons. We therefore bin for seasons.
merged_df$Q1 <- merged_df$month_jan+merged_df$month_feb+merged_df$month_mar
merged_df$Q2 <- merged_df$month_apr+merged_df$month_may+merged_df$month_jun
merged_df$Q3 <- merged_df$month_jul+merged_df$month_aug+merged_df$month_sep
merged_df$Q4 <- merged_df$month_oct+merged_df$month_nov+merged_df$month_dec
train_featured <- merged_df[1:nrow(train),]
train_featured$y <- y
test_featured <- merged_df[(nrow(train)+1):nrow(merged_df),]
Here we define a number of models, to see which one will preform the best.
lm_over <- train_val(train_featured, "glm","over")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 691 952
no 148 5442
Accuracy : 0.8479
95% CI : (0.8394, 0.8561)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.4764
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.82360
Specificity : 0.85111
Pos Pred Value : 0.42057
Neg Pred Value : 0.97352
Precision : 0.42057
Recall : 0.82360
F1 : 0.55681
Prevalence : 0.11600
Detection Rate : 0.09553
Detection Prevalence : 0.22715
Balanced Accuracy : 0.83735
'Positive' Class : yes
lm_under <- train_val(train_featured, "glm","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 695 959
no 144 5435
Accuracy : 0.8475
95% CI : (0.839, 0.8557)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.4771
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.82837
Specificity : 0.85002
Pos Pred Value : 0.42019
Neg Pred Value : 0.97419
Precision : 0.42019
Recall : 0.82837
F1 : 0.55756
Prevalence : 0.11600
Detection Rate : 0.09609
Detection Prevalence : 0.22867
Balanced Accuracy : 0.83919
'Positive' Class : yes
lm_S <- train_val(train_featured, "glm","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 621 785
no 218 5609
Accuracy : 0.8613
95% CI : (0.8532, 0.8692)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.4773
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.74017
Specificity : 0.87723
Pos Pred Value : 0.44168
Neg Pred Value : 0.96259
Precision : 0.44168
Recall : 0.74017
F1 : 0.55323
Prevalence : 0.11600
Detection Rate : 0.08586
Detection Prevalence : 0.19439
Balanced Accuracy : 0.80870
'Positive' Class : yes
rf_over <- train_val(train_featured, "ranger","over")
Growing trees.. Progress: 84%. Estimated remaining time: 6 seconds.
Growing trees.. Progress: 65%. Estimated remaining time: 16 seconds.
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 66%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 67%. Estimated remaining time: 15 seconds.
Growing trees.. Progress: 89%. Estimated remaining time: 3 seconds.
Growing trees.. Progress: 66%. Estimated remaining time: 16 seconds.
Growing trees.. Progress: 88%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 67%. Estimated remaining time: 15 seconds.
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 454 366
no 385 6028
Accuracy : 0.8962
95% CI : (0.8889, 0.9031)
No Information Rate : 0.884
P-Value [Acc > NIR] : 0.0005591
Kappa : 0.4887
Mcnemar's Test P-Value : 0.5112907
Sensitivity : 0.54112
Specificity : 0.94276
Pos Pred Value : 0.55366
Neg Pred Value : 0.93997
Precision : 0.55366
Recall : 0.54112
F1 : 0.54732
Prevalence : 0.11600
Detection Rate : 0.06277
Detection Prevalence : 0.11337
Balanced Accuracy : 0.74194
'Positive' Class : yes
rf_under <- train_val(train_featured, "ranger","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 755 1020
no 84 5374
Accuracy : 0.8474
95% CI : (0.8389, 0.8556)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.4987
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8999
Specificity : 0.8405
Pos Pred Value : 0.4254
Neg Pred Value : 0.9846
Precision : 0.4254
Recall : 0.8999
F1 : 0.5777
Prevalence : 0.1160
Detection Rate : 0.1044
Detection Prevalence : 0.2454
Balanced Accuracy : 0.8702
'Positive' Class : yes
rf_S <- train_val(train_featured, "ranger","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 646 710
no 193 5684
Accuracy : 0.8752
95% CI : (0.8673, 0.8827)
No Information Rate : 0.884
P-Value [Acc > NIR] : 0.9905
Kappa : 0.5198
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.76996
Specificity : 0.88896
Pos Pred Value : 0.47640
Neg Pred Value : 0.96716
Precision : 0.47640
Recall : 0.76996
F1 : 0.58861
Prevalence : 0.11600
Detection Rate : 0.08931
Detection Prevalence : 0.18747
Balanced Accuracy : 0.82946
'Positive' Class : yes
xgb_over <- train_val(train_featured, "xgbTree","over")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 735 891
no 104 5503
Accuracy : 0.8624
95% CI : (0.8543, 0.8703)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.5234
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8760
Specificity : 0.8607
Pos Pred Value : 0.4520
Neg Pred Value : 0.9815
Precision : 0.4520
Recall : 0.8760
F1 : 0.5963
Prevalence : 0.1160
Detection Rate : 0.1016
Detection Prevalence : 0.2248
Balanced Accuracy : 0.8683
'Positive' Class : yes
xgb_under <- train_val(train_featured, "xgbTree","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 739 1023
no 100 5371
Accuracy : 0.8447
95% CI : (0.8362, 0.853)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.4877
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8808
Specificity : 0.8400
Pos Pred Value : 0.4194
Neg Pred Value : 0.9817
Precision : 0.4194
Recall : 0.8808
F1 : 0.5682
Prevalence : 0.1160
Detection Rate : 0.1022
Detection Prevalence : 0.2436
Balanced Accuracy : 0.8604
'Positive' Class : yes
xgb_S <- train_val(train_featured, "xgbTree","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 615 477
no 224 5917
Accuracy : 0.9031
95% CI : (0.896, 0.9098)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1.15e-07
Kappa : 0.5822
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.73302
Specificity : 0.92540
Pos Pred Value : 0.56319
Neg Pred Value : 0.96352
Precision : 0.56319
Recall : 0.73302
F1 : 0.63698
Prevalence : 0.11600
Detection Rate : 0.08503
Detection Prevalence : 0.15097
Balanced Accuracy : 0.82921
'Positive' Class : yes
glmnet_over <- train_val(train_featured, "glmnet","over")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 693 946
no 146 5448
Accuracy : 0.849
95% CI : (0.8406, 0.8572)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.4794
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.82598
Specificity : 0.85205
Pos Pred Value : 0.42282
Neg Pred Value : 0.97390
Precision : 0.42282
Recall : 0.82598
F1 : 0.55932
Prevalence : 0.11600
Detection Rate : 0.09581
Detection Prevalence : 0.22660
Balanced Accuracy : 0.83902
'Positive' Class : yes
glmnet_under <- train_val(train_featured, "glmnet","under")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 682 885
no 157 5509
Accuracy : 0.8559
95% CI : (0.8476, 0.864)
No Information Rate : 0.884
P-Value [Acc > NIR] : 1
Kappa : 0.4898
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.81287
Specificity : 0.86159
Pos Pred Value : 0.43523
Neg Pred Value : 0.97229
Precision : 0.43523
Recall : 0.81287
F1 : 0.56692
Prevalence : 0.11600
Detection Rate : 0.09429
Detection Prevalence : 0.21665
Balanced Accuracy : 0.83723
'Positive' Class : yes
glmnet_S <- train_val(train_featured, "glmnet","SMOTE")
[1] "Holdout Scores"
Confusion Matrix and Statistics
predicted yes no
yes 654 736
no 185 5658
Accuracy : 0.8727
95% CI : (0.8648, 0.8803)
No Information Rate : 0.884
P-Value [Acc > NIR] : 0.9986
Kappa : 0.5169
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.77950
Specificity : 0.88489
Pos Pred Value : 0.47050
Neg Pred Value : 0.96834
Precision : 0.47050
Recall : 0.77950
F1 : 0.58681
Prevalence : 0.11600
Detection Rate : 0.09042
Detection Prevalence : 0.19217
Balanced Accuracy : 0.83220
'Positive' Class : yes
We print the results to compare the different models.
results <- resamples(list(
LMO = lm_over,
LMU = lm_under,
LMS = lm_S,
RFO = rf_over,
RFU = rf_under,
RFS = rf_S,
XGBO = xgb_over,
XGBU = xgb_under,
XGBS = xgb_S,
GLO = glmnet_over,
GLU = glmnet_under,
GLS = glmnet_S)
)
summary(results)
Call:
summary.resamples(object = results)
Models: LMO, LMU, LMS, RFO, RFU, RFS, XGBO, XGBU, XGBS, GLO, GLU, GLS
Number of resamples: 5
AUC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LMO 0.8878020 0.8903696 0.8909167 0.8909165 0.8920046 0.8934896 0
LMU 0.8855870 0.8885111 0.8945513 0.8946716 0.9015525 0.9031563 0
LMS 0.8623896 0.8625047 0.8787070 0.8737488 0.8810630 0.8840797 0
RFO 0.4978557 0.5001908 0.5004142 0.5050155 0.5073429 0.5192739 0
RFU 0.8829104 0.8888379 0.8899093 0.8938761 0.8939620 0.9137610 0
RFS 0.9613913 0.9624824 0.9626484 0.9637817 0.9635713 0.9688150 0
XGBO 0.9306198 0.9309255 0.9321529 0.9325274 0.9324328 0.9365060 0
XGBU 0.8891333 0.8908632 0.8981468 0.8963347 0.9008828 0.9026474 0
XGBS 0.9684089 0.9687097 0.9692926 0.9696295 0.9706853 0.9710509 0
GLO 0.8915520 0.8945663 0.8954650 0.8950614 0.8959636 0.8977602 0
GLU 0.8756316 0.8848797 0.8882006 0.8894145 0.8980499 0.9003107 0
GLS 0.8574968 0.8665558 0.8673677 0.8689898 0.8755728 0.8779559 0
F
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LMO 0.8332179 0.8350313 0.8368093 0.8363630 0.8380087 0.8387479 0
LMU 0.8111782 0.8230828 0.8240602 0.8310036 0.8438438 0.8528529 0
LMS 0.8155194 0.8162025 0.8166794 0.8215625 0.8274994 0.8319117 0
RFO 0.9659992 0.9664493 0.9664620 0.9675197 0.9687441 0.9699441 0
RFU 0.8605974 0.8622927 0.8625455 0.8663620 0.8701299 0.8762447 0
RFS 0.9269240 0.9276760 0.9288317 0.9313133 0.9356952 0.9374394 0
XGBO 0.8969868 0.8983051 0.9020127 0.9010425 0.9022016 0.9057065 0
XGBU 0.8554996 0.8563135 0.8659341 0.8629170 0.8673993 0.8694384 0
XGBS 0.8960483 0.8968096 0.9016270 0.9005217 0.9026680 0.9054554 0
GLO 0.8364216 0.8382878 0.8387609 0.8389365 0.8402494 0.8409629 0
GLU 0.8199697 0.8238806 0.8315630 0.8297333 0.8337096 0.8395437 0
GLS 0.8147405 0.8157236 0.8169869 0.8175628 0.8184802 0.8218830 0
Precision
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LMO 0.8369992 0.8439944 0.8464307 0.8443671 0.8470447 0.8473663 0
LMU 0.8223583 0.8328267 0.8390093 0.8407451 0.8502269 0.8593041 0
LMS 0.8223120 0.8322148 0.8368942 0.8342632 0.8395503 0.8403447 0
RFO 0.9345760 0.9352715 0.9354190 0.9373620 0.9397280 0.9418155 0
RFU 0.8242507 0.8351955 0.8380952 0.8366455 0.8423295 0.8433566 0
RFS 0.9015933 0.9068884 0.9087452 0.9095903 0.9150451 0.9156798 0
XGBO 0.8653706 0.8664850 0.8724869 0.8701277 0.8730565 0.8732394 0
XGBU 0.8263889 0.8345120 0.8515850 0.8436317 0.8526466 0.8530259 0
XGBS 0.8982301 0.8999019 0.9004004 0.9031675 0.9086269 0.9086781 0
GLO 0.8374757 0.8490040 0.8499096 0.8483190 0.8523002 0.8529053 0
GLU 0.8263473 0.8325653 0.8419453 0.8409974 0.8469861 0.8571429 0
GLS 0.8223120 0.8227848 0.8321318 0.8304507 0.8325629 0.8424622 0
Recall
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
LMO 0.8227131 0.8274042 0.8291634 0.8285313 0.8303030 0.8330727 0
LMU 0.8002981 0.8077496 0.8154762 0.8215155 0.8375559 0.8464978 0
LMS 0.7974181 0.8007948 0.8088381 0.8092974 0.8157895 0.8236463 0
RFO 0.9996090 0.9996091 0.9996091 0.9996872 0.9998045 0.9998045 0
RFU 0.8837556 0.8912072 0.8986587 0.8983904 0.9002976 0.9180328 0
RFS 0.9478649 0.9498261 0.9553128 0.9541122 0.9572989 0.9602583 0
XGBO 0.9310008 0.9325513 0.9329554 0.9342429 0.9331509 0.9415559 0
XGBU 0.8792846 0.8807750 0.8822653 0.8831932 0.8867362 0.8869048 0
XGBS 0.8838133 0.8932473 0.8946846 0.8979946 0.9071500 0.9110780 0
GLO 0.8205629 0.8256450 0.8269795 0.8298603 0.8330727 0.8430414 0
GLU 0.8077496 0.8166915 0.8214286 0.8188312 0.8226528 0.8256334 0
GLS 0.8022851 0.8023833 0.8048659 0.8051253 0.8068520 0.8092399 0
We plot the results to get a deeper understanding which is our best model.
bwplot(results, layout = c(2, 1), scales = list(relation="free"))
Based on this, we realize that the best performing model is the Random Forest with SMOTE sampling.
Finally we perform a cross validated, randomized grid search on the chosen random forest in order to define the final model, optimizing for Recall.
grid<-expand.grid(
mtry=c(16,32,63),
splitrule=c('extratrees', 'gini'),
min.node.size=c(1,3,5)
)
tc<-trainControl(
method = "cv",
number=5,
savePredictions = TRUE,
summaryFunction = prSummary,
classProbs=TRUE,
search = "random"
)
ini<-now()
trainIndex <- createDataPartition(train_featured$y, p = .95, list = FALSE)
model_train <- train_featured[ trainIndex,]
holdout <- train_featured[-trainIndex,]
model_train<-SMOTE(y ~ ., data = model_train)
grid_rf <- train(
y ~ .,
data = model_train,
method = "ranger",
num.trees=500,
tuneGrid = grid,
trControl = tc,
metric = "Recall",
tuneLength = 5,
verbose = TRUE,
importance = "impurity",
preProcess=pp
)
print(now()-ini)
Time difference of 47.0147 mins
grid_rf
Random Forest
27902 samples
63 predictor
2 classes: 'yes', 'no'
Pre-processing: centered (63), scaled (63), Yeo-Johnson transformation (63)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 22323, 22321, 22322, 22321, 22321
Resampling results across tuning parameters:
mtry splitrule min.node.size AUC Precision Recall
16 extratrees 1 0.9739380 0.9133025 0.9504094
16 extratrees 3 0.9737242 0.9127051 0.9493226
16 extratrees 5 0.9714476 0.9081152 0.9462284
16 gini 1 0.9705882 0.9087755 0.9453925
16 gini 3 0.9709983 0.9082192 0.9448070
16 gini 5 0.9709856 0.9080662 0.9414623
32 extratrees 1 0.9663423 0.9062043 0.9514970
32 extratrees 3 0.9693028 0.9064545 0.9503262
32 extratrees 5 0.9682875 0.9027077 0.9487373
32 gini 1 0.9612021 0.9033966 0.9422148
32 gini 3 0.9630954 0.9037762 0.9415456
32 gini 5 0.9646612 0.9026009 0.9398731
63 extratrees 1 0.9601284 0.9022918 0.9487370
63 extratrees 3 0.9622551 0.9029146 0.9486536
63 extratrees 5 0.9632128 0.8999184 0.9457268
63 gini 1 0.9331448 0.9010824 0.9312595
63 gini 3 0.9356459 0.9008694 0.9298382
63 gini 5 0.9396563 0.9006608 0.9298383
F
0.9314820
0.9306476
0.9267747
0.9267183
0.9261436
0.9244547
0.9282885
0.9278635
0.9251417
0.9223890
0.9222649
0.9208522
0.9249180
0.9252099
0.9222444
0.9159052
0.9151051
0.9149929
Recall was used to select the optimal model using the largest value.
The final values used for the model were mtry = 32, splitrule =
extratrees and min.node.size = 1.
And this is the final model:
grid_rf$finalModel
Ranger result
Call:
ranger::ranger(dependent.variable.name = ".outcome", data = x, mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size, splitrule = as.character(param$splitrule), write.forest = TRUE, probability = classProbs, ...)
Type: Probability estimation
Number of trees: 500
Sample size: 27902
Number of independent variables: 63
Mtry: 32
Target node size: 1
Variable importance mode: impurity
Splitrule: extratrees
OOB prediction error (Brier s.): 0.04479533
Here we can see the Recall of the model accross the grid search process.
plot(grid_rf)
Here we see the variable importance for the final model.
plot(varImp(grid_rf), top = 20)
predicted <- predict(grid_rf, newdata = holdout)
roc_curve<-roc(response=as.numeric(factor(holdout$y == "yes")), predictor=as.numeric(factor(predicted == "yes")))
Setting levels: control = 1, case = 2
Setting direction: controls < cases
plot.roc(roc_curve, main='ROC on Holdout');grid()
print(confusionMatrix(table(predicted, holdout$y), positive="yes", mode="everything"))
Confusion Matrix and Statistics
predicted yes no
yes 170 174
no 39 1424
Accuracy : 0.8821
95% CI : (0.8664, 0.8966)
No Information Rate : 0.8843
P-Value [Acc > NIR] : 0.6328
Kappa : 0.5501
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.81340
Specificity : 0.89111
Pos Pred Value : 0.49419
Neg Pred Value : 0.97334
Precision : 0.49419
Recall : 0.81340
F1 : 0.61483
Prevalence : 0.11566
Detection Rate : 0.09408
Detection Prevalence : 0.19037
Balanced Accuracy : 0.85226
'Positive' Class : yes
plot(table(predicted=predicted, real=holdout$y))
We predict on the test set, add the predictions to it and create our predictions file.
predicted <- predict(grid_rf, newdata = test_featured)
df_pred<-cbind(test, predicted)
fwrite(df_pred[,c("predicted")],"predictions.csv")
We conclude that after extensive feature exploration, feature engineering, and modelling, the best preforming model is a Random Forrest with SMOTE sampling and the parameters showed above. The final sensitivity score on 5% of the data as a holdout is 0.81.
The following are the main considerations and conclusions that we can extract from each part of the process:
The initial dataset was highly unbalanced where, for our target variable (term deposit subscrption), 88% of the labels correspond to class NO. Considering this, we tested different sample techniques (oversampling, undersampling and SMOTE) to understan which technique performs better.
A Shiny Dashboard was built to present some plots and basic data, in order to understand the data given in a easier way. R gives this kind of useful tools to manage in a more visual and interactive way all kind of data.
We created a set of additional features based on the initial dataset. From these new features we can see that for the final model Days to end of month, Quantity of loans, Q3 and Balance General Status, among others were relevant.
Considering that the Random forest algorithm automatically performs feature selection, we can see from the plot above that the most remarkable variable is duration.
Our final feature set remains the same beacuse there was no high significant correlation, low score of dispersion or no linear dependencies between all the features.
We assume that having a false positive error brings a higher cost than a false negative; so in this case we base our results on the sensitivity metric (True positive rate).
After running a Logistc Regression as baseline model and included new features, we tested 4 different classification techniques:
The best Recall score was from the Random Forest model so a cross validated, randomized grid search was applied in order to define the final model and optimize Recall. Different tuning parameters were tested given the best outcome the following ones:
* ‘min.node.size’=1 * ‘mtry’ = 32 * ‘splitrule’ = extratrees * ‘min.node.size’ = 1
The results of the final model were presented with an ROC curve and a plotted confusion matrix.
Finally, the target variable was predicted and on the test set and a .csv file was created to store all the data.